DIAC+: a Professional Diacritics Recovering System

نویسندگان

  • Dan Tufis
  • Alexandru Ceausu
چکیده

In languages that use diacritical characters, if these special signs are stripped-off from a word, the resulted string of characters may not exist in the language, and therefore its normative form is, in general, easy to recover. However, this is not always the case, as presence or absence of a diacritical sign attached to a base letter of a word which exists in both variants, may change its grammatical properties or even the meaning, making the recovery of the missing diacritics a difficult task, not only for a program but sometimes even for a human reader. We describe and evaluate an accurate knowledge-based system for automatic recovering the missing diacritics in MSOffice documents written in Romanian. For the rare cases when the system is not able to reliably make a decision, it either provides the user a list of words with their recovery suggestions, or probabilistically choose one of the possible changes, but leaves a trace (a highlighted comment) on each word the modification of which was uncertain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Diacritics Restoration in Romanian Texts

There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depending on the functional role of the diacritical characters. For Romanian, automatic restoration o...

متن کامل

Automatic diacritization of Arabic transcripts for automatic speech recognition

Arabic orthography does not provide full vocalization of the text, and the reader is expected to infer short vowels from the context of the sentence. Inferring the full form of a word is useful when developing Arabic speech and language processing tools, since it is likely to reduce ambiguity in these tasks. In this paper, we present generative techniques for recovering vowels and other diacrit...

متن کامل

Reconstruction of Polish diacritics in a text-to-speech system

This paper describes an approach to reconstruction of the Polish diacritic signs, needed e.g. in a speech synthesis system. Some telecommunication services (for example SMS transmission in GSM) remove diacritics from the text. Without them the text is usually still understandable to a reader, but if a TTS system reads it, the speech becomes heavily distorted. In this paper we propose to use neu...

متن کامل

Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

--The script of Sindhi Language is highly complex due to many complexities including abundance of homographic words. The interpretation of the text turns so tough due to the possibility of multitudinal meanings associated with a homographic word unless given specific pronunciation with the help of diacritics. Diacritics help the readers to comprehend the text easily. Due to the rapidly developi...

متن کامل

A robust diacritics restoration system using unreliable raw text data

Statistical language models are utilized in many speech processing algorithms, e.g., automatic speech recognition (ASR). Such a model is created from a text corpus, but many of the text corpora for Romanian are unreliable with respect to the use of diacritic marks, i.e., diacritics are either partially or completely missing, resulting in low quality language models. We present a methodology for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008